Data classification based on point-of-view dependency

ABSTRACT

Data classification is used to classified input items by associating the input items with one or more classes from a set of one or more classes in a data classification system, including identifying relevant features in an input item to form a feature vector for the input item, receiving at the data classification system an indication of a point-of-view, adjusting the feature vector according to the point-of-view indication or modifying a pattern discriminator (e.g., trainer and classifier) to inline-process feature vectors depending on the provided point-of-view (e.g., SVM custom kernels), and classifying the input item into the set of classes according to the point-of-view. The point-of-view data can be introduced either as a pre-process step prior to passing it off to the pattern discrimination algorithm, or can be incorporated directly into the pattern discrimination algorithm if applicable. The pattern discrimination algorithms can detect arbitrary patterns given a similarly prepared dataset during both training and subsequent classification of unclassified documents.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a Continuation of U.S. application Ser. No.10/931,291, filed Aug. 30, 2004, entitled DATA CLASSIFICATION BASED ONPOINT-OF-VIEW DEPENDENCY,” (Attorney Docket No. 021389-000410US) nowallowed, which claims priority from co-pending U.S. Provisional PatentApplication No. 60/499,196 filed Aug. 28, 2003 entitled DATACLASSIFICATION BASED ON POINT-OF-VIEW DEPENDENCY, all of which arehereby incorporated by reference, as if set forth in full in thisdocument, for all purposes.

FIELD OF THE INVENTION

The present invention relates to automated data classification ingeneral and data classifiers of documents based on content inparticular.

BACKGROUND OF THE INVENTION

Data classification systems are useful in many applications. Oneapplication is in filtering data, as might be done as part of a searchover a corpus of data. While many data structures might be used with adata classification system, a typical example is a corpus containingmany, many data items organized as units such as records or documents.While a document is used as an example of a data item, it should beunderstood that statements might be equally applicable to data itemsthat are not normally referred to as documents.

A data classification system might be used to a filter documents from alarge corpus to flag or otherwise identify relevant documents distinctlyfrom less relevant documents. As an example, a company or an analystmight want to review news items from a large corpus of news items, butonly those news items that relate to a particular company or set ofcompanies. They could use a data classification system to flag newsitems that relate to the companies of interest and provide thoserelevant documents for further processing, such as manual review.

In the general case, a data classification system classifies documentsas being “in” or “not in” a particular class, or classifies documents asbeing in one or more of two or more classes. In an extremely simple dataclassification system, a class might be “all documents containing phraseP” and the simple data classification system classifies each document aseither being in the class or not being in the class (binary decision).In other simple, but slightly more involved data classification systems,the class might be “all documents mentioning phrase P or its synonyms”or the class might be “all documents apparently relating to topic T”.

A conventional data classification system might first convert documentsinto enumerated features though a process of feature generation. One waythat this can be done is to tokenize text into a distinct dictionary offeatures with associated enumerated values. Advanced techniques maypre-process text with grammatical knowledge to enrich tokens in a way toaid in a classification task (e.g., part-of-speech POS tagging, negationprefixing, etc.). “Stop” words (“a”, “the”, “but”, “and”, etc.) areoften removed to improve efficiency. With each document distilled to aset of enumerated features, the data classification system can thenperform feature selection, selecting a subset of features that eitherenhance, or at least minimize loss of, the information content of thedocument. Arguably, feature selection is primarily performed forefficiency reasons, as many machine learning algorithms displaynon-linear efficiency with respect to the number of distinct features.

The selected features can be weighted (which can also be thought of as a“soft” feature selection, where some features are selected strongly andother features are selected weakly), to enhance a machine learningalgorithm. An example of feature weighting is the use of InverseDocument Frequency (IDF), wherein terms get more weight if they occurmore frequently than their general average in a wider corpus and lessweight if they appear less frequently than their general average.

The above processes can be done on documents in a training corpus aswell as documents in the corpus that are to be classified. Trainingmight involve providing the data classification system with a corpus andclassifications for each document in the training corpus. Thus, for asimple binary classification process, some of the documents in thetraining corpus are tagged as being examples of members of the classwhile the others are tagged as being counterexamples.

The data classification system then operates a training process whereindiscriminating patterns are preferably discovered in the training corpusbetween the examples and the counterexamples. Techniques for patterndiscrimination have been studied in considerable detail. Examples ofmachine learning classification techniques include, but are not limitedto, Naïve Bayes, Support Vector Machines, Maximum Entropy, and k-nearestneighbor. Others might be found in use or in literature on the topic.

More complex data classification systems have been developed. Forexample, instead of simply classifying an input document as being anexample of a member of the class or a counterexample (a binaryclassification), the input document might be classified into one of morethan just two possibilities (M-ary classification into M classes). Forexample, when evaluating news stories, a simple data classificationsystem might just make a binary decision as to whether a particular newsstory refers to topic T or not, while a more complex data classificationsystem might define each class as relating to a particular topic andwould classify the input document into one or more of two or moreclasses.

Data classification systems might make hard decisions as to how toclassify a given input document. Some data classification systems mightmake soft decisions, wherein an input document is not necessarilyclassified into a class with absolute certainty, but it is tagged withone or more value(s) indicating the degree(s) to which the documentwould be associated with each of one or more classes.

One problem with existing data classification systems is that real worldexamples might be more involved and items would be classifieddifferently depending on other considerations. Hence, there is aconsiderable need in the art for a more sophisticated classificationsystem capable of classifying items based on multiple inputs intomultidimensional categories.

BRIEF SUMMARY OF THE INVENTION

One aspect of the present invention provides a method of dataclassification, wherein an input item is classified by associating theinput item with one or more classes from a set of one or more classes ina data classification system, including identifying relevant features inan input item to form a feature vector for the input item, receiving atthe data classification system an indication of a point-of-view,adjusting the feature vector according to the point-of-view indicationor modifying a pattern discriminator (e.g., trainer and classifier) toinline-process feature vectors depending on the provided point-of-view(e.g., SVM custom kernels), and classifying the input item into the setof classes according to the point-of-view. The point-of-view data can beintroduced either as a pre-process step prior to passing it off to thepattern discriminator, or can be incorporated directly into the patterndiscriminator if it is applicable (e.g., custom kernels in a supportvector machine could be enhanced with point-of-view data). The patterndiscriminator can detect arbitrary patterns given a similarly prepareddataset during both training and subsequent classification ofunclassified documents.

Some advantages of such a system include improved accuracy within agiven classification problem, as it focuses the pattern discriminationengine on the correct context given a point-of-view to operate from.Another advantage is improvements over applications of a given trainedmodel to new points-of-view not incorporated in the original training.This can be the result of methodologies focusing on the features relatedto the point-of-view while having the effect of abstracting thepoint-of-view itself.

Another aspect of the present invention provides a data classificationsystem, wherein the system includes at least one input item, at leastone feature vector, and at least one data classifier defined bypoint-of-view dependency, wherein the system uses feature weighting inorder to rate and classify input items. The data classifier classifiesone or more data sets based upon patterns observed during a trainingprocess with one or more training data sets. In addition, the dataclassification system may rely on a mathematical engine, such as asupport vector machine, to engage in feature weighting.

As used herein, each item to be classified is described as a document,but it should be understood that items that are classified are notlimited to items that are considered documents in all contexts. Forexample, an input item may include, but is not limited to, a wordprocessing document, a file of a particular format (e.g., ASCII file,XML file, UTF-8 file, etc.), a collection of documents with somestructural organization, an image, text, a combination of images andtext, media, spreadsheet data, a collection of bytes, or otherorganizations of data or data streams. A data classification system isprovided with access to one or more of the items of the corpus and basedupon the analysis of the one or more items, the data classificationsystem arrives at a determination about each of the one or more items,where an example determination is whether or not a given item belongsinto a particular class. In some cases, the data classification systemis “trained” using examples, wherein the data classification system isprovided with several example items and an indication, for each of theexample items, of the classification for those example items.

The invention further encompasses data classifiers that classifyreceived data sets based upon specific patterns. These patterns areobserved during a training process with training data sets.

A further understanding of the nature and advantages of the inventionsherein may be realized by reference to the remaining portions of thespecification and the attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is best understood when read in conjunction withthe accompanying figures which serve to illustrate the preferredembodiments. It is understood, however, that the invention is notlimited to the specific embodiments disclosed in the figures.

FIG. 1 illustrates a classifier with point-of-view dependenciesaccording to aspects of the present invention.

FIG. 2 illustrates a data classification system according to aspects ofthe present invention.

FIG. 3 illustrates a trainer in a data classification system accordingto aspects of the present invention.

FIG. 4 illustrates a classifier in a data classification systemaccording to aspects of the present invention.

FIG. 5 illustrates a system wizard usable for accepting user input of apoint of view for a training session.

FIG. 6 illustrates a system wizard usable for accepting user input onvariances of the predicted sentiment for an automatically selected setof articles.

DETAILED DESCRIPTION OF THE INVENTION (a) Definitions and GeneralParameters

The following definitions are set forth to illustrate and define themeaning and scope of the various terms used herein.

The terms “input item” and “document” are interchangeably used hereinand refer to any item that can be used in conjunction with the presentclassification method. For example, an input item may include, but isnot limited to, a word processing document, a file of a particularformat (e.g., ASCII file, XML file, UTF-8 file, etc.), a collection ofdocuments with some structural organization, an image, text, acombination of images and text, media, spreadsheet data, a collection ofbytes, or other organizations of data or data streams.

The term “relevant feature” refers to a uniquely identifiable attributethat could affect the detection of patterns within a corpus. Relevantfeatures might be domain specific, for example, in the case of Englishtext classification, a relevant feature might be the presence of aunique word within a document, regardless of position.

A “feature vector”, as used herein, refers to a list of featuresdescribing an instance, wherein a feature is the specification of anattribute and its value. For example, in the case of English textclassification, the attribute of a feature might be a unique word withina document and the value of the feature might be the number ofoccurrences of the unique word within the document.

The term “classifier” refers to a system, apparatus or code for mappingfrom unlabeled instances to discrete classes. Classifiers may use amapping form (e.g., decision tree) and an interpretation procedure,including rules for how to handle unknowns. Some classifiers might alsoprovide probability estimates or scores. These scores can be evaluatedto yield a discrete decision.

The term “trainer” refers to a system, apparatus or code for examining aset of known labeled instances to detect implicit patterns and createmodels. Classifiers can then apply these models to future unlabelledinstances to generate discrete classes (see “classifier”).

The terms “point-of-view” and “POV” are interchangeably used herein andrefer to a variable frame of reference when examining or processing acurrent document. The same instance may be placed in different classesgiven different “point-of-views”.

(b) Data Classifiers with Point-of-View Dependencies

FIG. 1 is a block diagram of a classifier with point-of-viewdependencies according to embodiments of the present invention. Usingthe novel data classification systems and methods described herein,input documents can be classified into classes for a given point of view(point-of-view dependency). In many cases, a corpus might be dividedinto classes one way for one point-of-view and would be divided intothose classes differently for a different point-of-view. As shown inFIG. 1, classifier 40 might classify an input document 15 into class 301or 302 depending on a given point of view 201 or 202. This allows forimproved data assessment over a conventional data classification systemthat might always classify a document into an example or acounterexample (or one or more of a plurality of classes in M-aryclassification).

An example illustrating the instant system might be the collection ofdocuments regarding a lawsuit. The documents will likely containreferences to a defendant (e.g., company A) and several side referencesto other companies (e.g., companies B-D). From the point-of-view ofcompany A, the documents should be classified as belonging to a lawsuitclass that company A may track and analyze daily. From the point of viewof companies B-D, the very same documents would not be consideredlawsuit documents (i.e., classified under a lawsuit class) because, fromtheir perspective, the documents are not about a lawsuit concerningcompanies B-D.

In another example, an analyst might be searching a news report corpusfor articles about layoffs and the data classification system mightclassify incoming articles as being about layoffs, or not about layoffs.Given a document with two threads of discussion, one regarding layoffsat Company A and one regarding a merger at Company B, a traditional dataclassification system would only recognize that the article is aboutlayoffs regardless of whether it concerns Company A or Company B.However, in embodiments of a novel data classification system asdescribed herein, this distinction is easily made. As such, the systemis trained against the difference and would be able to, when given theCompany A point-of-view, correctly classify as Layoffs and when giventhe Company B point-of-view, correctly classify as NOT-Layoffs withinthe same document. Hence, the classification of “sentiment” or“favorability” applies perfectly herein, i.e., given a certain document,it may easily be defined as “favorable” for one company and defined as“unfavorable” for another company.

FIG. 2 is a high-level block diagram illustrating a data classificationsystem 10 according to the present invention. System 10 accepts atraining corpus, such as training corpus 12, for training the dataclassification system. In one embodiment, the training corpus 12contains labeled documents. For a simple binary data classificationsystem in this embodiment, some of the documents in the training corpusare tagged as being examples of members of the class while the othersare tagged as being counterexamples. The data classification system 10then operates a trainer 20 wherein discriminating patterns arediscovered in the training corpus 12 between the examples andcounterexamples to generate a model 30. Examples of patterndiscrimination techniques include, but are not limited to, Naïve Bayes,Support Vector Machines, Maximum Entropy, and k-nearest neighbor.

When a new input document 15 is presented, system 10 operates aclassifier 40, which classifies document 15 using model 30 and generatesa predicted class 50. For example, in a simple binary classificationsystem, document 15 is classified as either being in the class or notbeing in the class.

Point of view (POV) 60 includes context sensitive information to enabletrainer 20 to discriminate a single classification between multiplepoints-of-view. Similarly, POV 70 is an input to classifier 40 to enableit to generate a single classification between multiple points-of-view.

FIG. 3 shows trainer 20 in greater detail. Trainer 20 accepts a trainingcorpus 12 as its input and outputs a model 30 with discerned patterns intraining corpus 12. In the figure, trainer 20 is shown with a featuregenerator 1, feature selector 2, feature weighter 3, and a patterndiscriminator 4. Feature generator 1 generates, from input corpus 12,features to be considered for discrimination. With each document intraining corpus 12 distilled to a set of enumerated features, trainer 20can then operate feature selector 2 to select a subset of features thateither enhance, or at least minimize loss of, the information content ofthe document. Feature selection is primarily performed for efficiencyreasons, as many pattern discrimination techniques display non-linearefficiency with respect to the number of distinct features. Featureweighter 3 then weights the selected features to enhance a patterndiscrimination algorithm. Finally, a pattern discriminator 4 is run todiscriminate patterns within the training corpus, where the patterndiscriminator is optionally provided with a custom kernel when thepattern discriminator supports it. Point-of-view information 60 can beintroduced in various components of trainer 20 to discriminate a singleclassification between multiple points-of-view.FIG. 4 shows classifier 40 in greater detail. Classifier 40 accepts aninput document 15 as its input and outputs a predicted class 50 forinput document 15. In the figure, classifier 40 is shown with a featuregenerator 5, a feature selector 6, a feature weighter 7, and a modelapplier 8. Feature generator 5 generates features from input document 15to be considered for discrimination. After input document 15 isdistilled into a set of enumerated features, classifier 40 can thenoperate a feature selector 6 to select a subset of features that eitherenhance, or at least minimize loss of, the information content of thedocument. Feature weighter 7 then weights the selected features.Finally, a model applier 8 applies model 30 to input document 15 topredict the class of input document 15, where a custom kernel isoptionally provided to model 30. Predicted class 50 is producedoptionally with confidence values for input document 15. Point-of-viewinformation 70 can be introduced in various components of classifier 40to enable it to generate a single classification between multiplepoints-of-view.

POV information uses include, but are not limited to, custom featuregeneration, feature selection, feature weighting and custom kernelgeneration. Custom feature generation based on POV could generateadditional features not normally generated in traditional classificationsystems where the additional features may be indicative of arelationship between a given POV and a conventional feature. Featureselection might be based on POV, wherein features that appear unrelatedare stripped out from the vector. Feature weighting might also be basedon POV, wherein features are weighted based on relationship (i.e., thevalue associated with the attribute is modified in cases where thepattern discrimination engine supports it). Similarly, a custom kernelmight be created when a pattern discriminator supports it (e.g., SupportVector Machine (SVM)). The custom kernel can apply POV weighting offeatures dynamically during training and classification.

(c) Examples

The following specific examples are intended to illustrate embodimentsof data classification systems according to aspects of the invention andshould not be construed as limiting the scope of the claims.

(i) Point-of-View (POV) Sentiment Ratings

A novel data classification system might employ POV sentiment ratings.These ratings capture a person's or organization's point-of-view on anysentiment (e.g., article sentiment) using a positive, neutral, andnegative (3-point) scale. In this system, documents can be classified aspositive, neutral or negative documents, with respect to a particularPOV. This can be used to provide an automated point-of-view sentimentclassification service. Where a document must be classified as positive,neutral or negative, the data classification system performs ternaryclassification. In other variations, gradations of positive and negativeare possible, yielding the more than three classes to choose from, e.g.,“strongly positive”, “positive”, “slightly positive”, neutral, etc. Inorder to achieve the ternary classification, the data classificationsystem is trained with a training corpus wherein each document in thecorpus is labeled with its class. The data classification system canthen create three models: a positive model, a neutral model, and anegative model. This process can be expanded to classification into morethan three classes.

The ratings extend a particular point-of-view to all articles for aspecific subject such as a company, competitor, or the like. The usercan gain business insights from enhanced sentiment reports as well assentiment report filters in all other reports. There are many examplesof how the user may gain important business insights by using thesystems described herein. For example, a user employed by company X mayneed to know who the top authors are that are currently writing negativearticles (i.e., negative from company X's perspective) about acompetitive lawsuit, wherein the user is particularly interested in allarticles written during the last week (i.e., top authors/negativearticles/lawsuit company topic/seven days rating). Alternatively, theuser may need to find out how editorial coverage opinion is changingwith respect to company X's handling of a specific crisis (i.e.,sentiment over time/crisis company topic rating). Yet in anotherscenario, the user may need to investigate what types of publicationscontain positive articles about company X's recent product launch (toppublications/positive articles/product launch company topic rating).

Sentiment ratings can be automatically applied as articles enter thesystem, i.e., without human intervention. More specifically, thesentiment ratings might work through a system wizard that captures thepoint-of-view for a subject while a person reviews and validates ratingssuch as ratings on articles. FIG. 5 illustrates a step of a systemwizard, which asks an user to choose a point of view for a trainingsession. For example, the user is asked to choose among Microsoft, SunMicrosystems, Hewlett-Packard, Dell, Gateway, Apple, and SonyElectronics as the point of view for the training session. FIG. 6 showsanother step of the system wizard, which asks an user to confirm,correct or ignore the predicted sentiment for an automatically selectedset of articles. For example, the user is asked to review positivepredictions for a set of articles for Microsoft. In a practical setting,new or revised ratings can be applied overnight to all articles for asubject within the system. New articles receive ratings as they enterthe system.

In one embodiment, the data classification systems and methods of theinstant invention employ event-based machine learning, including anadvanced patterns recognition engine, point-of-view capture algorithms,pre-population with a large corpus of rated events, and closed-looplearning for continued point-of-view learning. Generally, the more thesystem is taught the more the system knows. Hence, the wizard can be runmultiple times (i.e., trained) which improves rating consistency.

The wizard can be trained any time the user desires to tune sentimentratings and a new point-of-view may be applied to an entire user accounthistory in batch mode. Moreover, manual ratings and individual articlemanual overrides can be incorporated into new ratings going forward.Under specific circumstances, it may be necessary to rerun the wizard,particularly when there is a dramatic change in the article profiles.For example, if company X is confronted with a new crisis or company Xchanges from being a private company to being a public company, thewizard may have to be rerun and thus retrained.

After training, the data classification system can be used to predictclassification of an unlabeled instance. For example, in a ternaryclassification system, each of the three models is applied in order topredict classification, and a confidence number is returned from eachmodel's classification. Since the three models may disagree (two or moremodels could claim that the instance is in that model's class), aweighting scheme is applied amongst the three models to breakdisagreement and produce the single predicted class.

In tests, one implementation was 77%-97% accurate depending on thescenario. This was generated by training on ⅔ of a labeled corpus andtesting classification against the remaining ⅓ of a labeled corpus. Whena conventional process stack was applied against the same problem,in-corpus accuracy dropped significantly (10-30%), and cross-corpusaccuracy (application of trained model to a new corpus in differentdomain) fell to statistically insignificant levels (i.e., the resultswere no more accurate than random guesses).

(ii) SVM Pattern Discrimination

In one example, a Support Vector Machine (SVM) pattern discriminationalgorithm was chosen for classification. SVMs are capable of operatingefficiently on large feature spaces, which reduces the need to modifyfeature vectors for efficiency reasons. In addition, SVMs support theconcept of “weighting” feature vectors which was initially used.

In the feature weighting scheme that was examined, one can apply arelatively unsophisticated algorithm of weighting. Given a bag ofaliases representing a simplistic Point-of-View (e.g., “IBM”,“International Business Machines”, “Big-Blue”), one can weight allfeatures against sentence-level proximity to an alias within the bag. Inthe test-case, a feature was weighted according to the number ofsentences away from the nearest alias, using the formula shown inEquation 1, where FSP is feature-to-sentence proximity going forwardfrom the alias and BSP is feature-to-sentence proximity going backwardfrom the alias.

FeatureWeight=Max(0.95^(FSP),0.80^(BSP))  (Equ. 1)

Using the formula of Equation 1 had the effect of giving more weight tofeatures closer to the point-of-view, with more weight given for aproximity forward of the POV and less weight given for a proximity priorto the POV. Using that equation, documents were distinguishable on thebasis of POV.

Thus, adding context sensitive information to the feature vector enablesthe mathematical engine to discriminate a single classification betweendifferent POVs. This information can be added in any of the componentsin FIG. 3 and FIG. 4. In this example, it is done within “featureweighter” however it is equally applicable to all components.

Various modifications and variations of the present invention will beapparent to those skilled in the art without departing from the scopeand spirit of the invention. Although the invention has been describedin connection with specific preferred embodiments, it should beunderstood that the invention as claimed should not be unduly limited toto such specific embodiments. Indeed, various modifications of thedescribed modes for carrying out the invention which are obvious tothose skilled in the art are intended to be within the scope of theclaims.

1. A method of classification, wherein an input item is classified byassociating the input item with one or more classes from a set ofclasses in a data classification system, said method comprising thesteps of: receiving the input item to be classified; identifyingrelevant features in the input item to form a feature vector for theinput item; receiving an indication of a point of view at the dataclassification system; adjusting the feature vector or modifying apattern discriminator according to the point-of-view indication; andclassifying the input item into the set of classes according to thepoint-of-view.
 2. The method of claim 1, wherein the step of adjustingthe feature vector comprises generating custom features.
 3. The methodof claim 1, wherein the step of adjusting the feature vector comprisesselecting a subset of features.
 4. The method of claim 1, wherein thestep of modifying a pattern discrimination algorithm comprisesgenerating a custom kernel.
 5. The method of claim 1, wherein the stepof adjusting the feature vector comprises weighting features.
 6. Themethod of claim 5, wherein weighting features uses proximity weighting.7. The method of claim 6, wherein proximity weighting calculates weightof a feature as the maximum of 0.95 raised to the power of FSP and 0.80raised to the power of BSP, wherein FSP is the number of sentences goingforward from a nearest alias to the feature and BSP is the number ofsentences going backward from a nearest alias to the feature, wherein analias is a representation of a point-of-view.
 8. The method of claim 1,wherein the input item is selected from the group consisting of a wordprocessing document, an ASCII file, an XML file, a UTF-8 file, acollection of documents with some structural organization, an image, atext, a combination of images and text, media, spreadsheet data, acollection of bytes, an organization of data and a data stream.
 9. Adata classification system comprising at least one input item, at leastone feature vector, and at least one data classifier defined bypoint-of-view dependency, wherein the data classification system isconfigured to perform one or more of feature generation, featureselection, feature weighting, and custom kernel generation in order torate and classify the input item.
 10. The data classification system ofclaim 9, wherein the input item is selected from the group consisting ofa word processing document, an ASCII file, an XML file, a UTF-8 file, acollection of documents with some structural organization, an image, atext, a combination of images and text, media, spreadsheet data, acollection of bytes, an organization of data and a data stream.
 11. Thedata classification system of claim 9, wherein the data classifierclassifies one or more data sets based upon patterns observed during atraining process with one or more training data sets.